Automatically Regenerating Wrappers for Web Sources Using Results from Previous Queries
نویسندگان
چکیده
A substantial subset of the web data follows some kind of underlying structure. Nevertheless, HTML does not contain any schema or semantic information about the data it represents. A program able to provide software applications with a structured view of those semi-structured web sources is usually called a wrapper. Wrappers are able to accept a query against the source and return a set of structured results, thus enabling applications to access web data in a similar manner to that of information from databases. A significant problem in this approach arises because web sources may experiment changes that invalidate the current wrappers. In this paper, we present novel heuristics and algorithms to address this problem. Our approach is based on collecting some query results during wrapper operation. Then, when the source changes, they are used to generate a set of labeled examples that are provided as input to a wrapper induction algorithm able to regenerate the wrapper. We have tested our methods in several real-world web data extraction domains, obtaining high accuracy in all the steps of the process.
منابع مشابه
Maintaining Web Navigation Flows for Wrappers
A substantial subset of the web data follows some kind of underlying structure. In order to let software programs gain full benefit from these “semistructured” web sources, wrapper programs are built to provide a “machinereadable” view over them. A significant problem with wrappers is that, since web sources are autonomous, they may experience changes that invalidate the current wrapper, so aut...
متن کاملWrapper Maintenance: A Machine Learning Approach
The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wr...
متن کاملQuery planning in mediator based information systems
Information integration has gained new importance since the widespread success of the World Wide Web. The simplicity of data publishing on the web and the promises of the emerging eCommerce markets pose a strong incentive for data providers to offer their services on the Internet. Due to the exponential growth rate of the number of web sites, users are already faced with an overwhelming amount ...
متن کاملWrapper Generation for Web Accessible Data Sources
There is an increase in the number of data sources that can be queried across the WWW. Such sources typically support HTML forms-based interfaces and search engines query collections of suitably indexed data. The data is displayed via a browser. One drawback to these sources is that there is no standard programming interface suitable for applications to submit queries. Second, the output (answe...
متن کاملSemi-Automatic Wrapper Generation for Internet Information Sources
To simplify the task of obtaining information from the vast number of information sources that are available on the World Wide Web (WWW), we are building tools to build information mediators for extracting and integrating data from multiple Web sources. In a mediator based approach, wrappers are built around individual information sources, that provide translation between the mediator query lan...
متن کامل